Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis

While nearly all computational methods operate on pseudonymized personal data, re-identification remains a risk. With personal health data, this re-identification risk may be considered a double-crossing of patients’ trust. Herein, we present a new method to generate synthetic data of individual granularity while holding on to patients’ privacy. Developed for sensitive biomedical data, the method is patient-centric as it uses a local model to generate random new synthetic data, called an “avatar data”, for each initial sensitive individual. This method, compared with 2 other synthetic data generation techniques (Synthpop, CT-GAN), is applied to real health data with a clinical trial and a cancer observational study to evaluate the protection it provides while retaining the original statistical information. Compared to Synthpop and CT-GAN, the Avatar method shows a similar level of signal maintenance while allowing to compute additional privacy metrics. In the light of distance-based privacy metrics, each individual produces an avatar simulation that is on average indistinguishable from 12 other generated avatar simulations for the clinical trial and 24 for the observational study. Data transformation using the Avatar method both preserves, the evaluation of the treatment’s effectiveness with similar hazard ratios for the clinical trial (original HR = 0.49 [95% CI, 0.39–0.63] vs. avatar HR = 0.40 [95% CI, 0.31–0.52]) and the classification properties for the observational study (original AUC = 99.46 (s.e. 0.25) vs. avatar AUC = 99.84 (s.e. 0.12)). Once validated by privacy metrics, anonymous synthetic data enable the creation of value from sensitive pseudonymized data analyses by tackling the risk of a privacy breach.

For the WBCD datasets, we trained SVM models (70% training, 30% tests) with five features according to F-score results. Supplementary Table 2 compares the classification performances for the best original and avatar models. We obtained the same classification performance for both the original and avatar datasets.

SUPPLEMENTARY FIGURE 1
The Synthpop synthetic data generation method retained the statistical value of the datasets. For the AIDS ( Figure 1a) and WBCD ( Figure 1b) datasets, the factor analysis of mixed data (FAMD) projection of the first two components showed that the original and Synthpop data overlapped, including the outliers. This result indicates the preservation of the structural information. Figure 1c  ). For the WBCD dataset, Figure 1d shows the = 0. 59 (95% , 0. 46 − 0. 76) = 5. 24 − 05 F-score comparison for each cancer prediction variable. F-score computations for the Synthpop (purple) and original (orange) datasets were similar. Regarding the F-scores, the predictive model selected approaching variables, yielding comparable feature importance. These models have similar prediction performances (original: vs Synthpop: . = 99. 46 ( = 0. 25) = 99. 24 ( = 0. 17) Overall, these results suggest that Synthpop data support similar analyses. The Synthpop synthetic data lead to the same interpretations as those obtained with pseudonymized data. Meier and compared with the log-rank test and Cox proportional-hazards model, with a comparison between the original (plain lines) and AIDS Synthpop (dotted lines) for arms 0 (purple lines) and 1(red lines). The statistical p-values are computed using Wald test. The original and Synthpop WBCD datasets were separated into 70 training trials and 30 tests (100 times). (d) Comparison of F-scores for each variable. Error bars represent the 95% confidence interval. SVM machine-learning models were performed using five features selected by F-score. The AUC is presented for the original and Synthpop datasets. Abbreviations: FAMD: factor analysis for mixed data; AUC: area under the ROC curve; SVM: support vector machine, CI: confidence interval.

SUPPLEMENTARY FIGURE 2
The CT-GAN synthetic data generation method retained the statistical value of the datasets. . For the AIDS ( Figure 2a) and WBCD (Figure 2b) datasets, the factor analysis of mixed data (FAMD) projection of the two first components showed that the original data and CT-GAN have a decent overlap. We note however that a significant number of synthetic individuals have been generated between the two clusters distinguishable in the projection. Figure 2c compares the survival curves calculated with the CT-GAN dataset and the original AIDS dataset. As for original survival curves (continuous line), CT-GAN (dotted line) survival curve for arm 1 shows a higher proportion of patients not reaching the primary endpoint over time than arm 0 survival curve. The analysis of the CT-GAN data is leading to the same interpretations as the one obtained with sensitive data. The statistical p-value are computed using Wald test. The main trial results remained unchanged: arm 1 was more effective than arm 0 when comparing CD4 T-cell count over time (cf. original data: ; = 0. 49 (95% , 0. 39 − 0. 63) = 9. 19 − 09 vs CT-GAN data: ; ). We note however that trends = 0. 25 (95% , 0. 20 − 0. 31) < 2 − 16 from CT-GAN survival curves differ from originals by not adopting asymptotic characteristics at the end. For the WBCD dataset, Figure 2d shows the F-score comparison for each cancer prediction variable. F-score computations for the CT-GAN (blue) and original (orange) show different behaviors. The model based on CT-GAN data brings significantly more importance to the variables Bare Nuclei and Clump Thickness. For the rest, the results are comparable. These models have similar prediction performances (original: vs CT-GAN: . Overall, these = 99. 46 ( = 0. 25) = 99. 95 ( = 0. 13) results suggest that CT-GAN data support similar analyses. The CT-GAN synthetic data lead to interpretations approaching those obtained with pseudonymized data. confidence interval. SVM machine-learning models were performed using five features selected by F-score. The AUC is presented for the original and CT-GAN datasets. Abbreviations: FAMD: factor analysis for mixed data; AUC: area under the ROC curve; SVM: support vector machine, CI: confidence interval, CT-GAN: conditional tabular generative adversarial network.

SUPPLEMENTARY FIGURE 3
Supplementary Figure 3 shows the conditional probability for an individual to get a local cloaking score regarding the previous.
The smallest conditional probabilities are evenly distributed in the first column. It shows that it is unlikely to yield two successive avatarizations with a local cloaking of zero. The avatarizations are considered independent if the probability of a local cloaking of zero equals the conditional probability of a local cloaking of zero in a second avatarization, knowing that the first avatarization had a local cloaking of zero, i.e. P(X = 0) = P(X 2 = 0|X 1 = 0) Using 25 AIDS dataset avatarizations yields P(X = 0) = 0.06 and P(X 2 = 0|X 1 = 0) = 0.09 These two probabilities are both low and close; thus, the avatarization can be interpreted as quasi-independent from one simulation to the next. A distance attack leading to a correct re-identification is unrelated to the value of the individual's data but mostly related to the stochasticity of the method.

Figure 3: Heatmap of conditional probabilities (AIDS dataset).
Legend: Each case represents the proportion (in percent) of individuals with a local cloaking at {0, 1, 2, 3, 4 or 5 and more} at a second iteration knowing they had a local cloaking at {0, 1, 2, 3, 4 or 5 and more} at a first iteration.

SUPPLEMENTARY FIGURE 4
We performed 100 avatarizations of the AIDS datasets with the same parameters (k=20). For this dataset, we focused on the hazard ratio distribution for arms 1, 2, and 3 compared to arm 0. Results obtained with original data are then compared to those obtained with avatar data. We obtained a significant effect of the three treatments compared to arm 0. Arm 1 is showing the best performance.

SUPPLEMENTARY FIGURE 5
We performed 100 avatarizations of the WBCD datasets with the same parameters (k=20). For this dataset, we computed the F-score distribution of each variable. Results obtained with original data are then compared to those obtained with avatar data. F-scores are ranging around the original value and the order of importance is preserved. Such results imply that both strongly discriminating and poorly discriminating variables for prediction retained their properties.

SUPPLEMENTARY FIGURE 6
Supplementary Figure 6 illustrates the DCR and NNDR metrics. The Distance to Closest Record (DCR) is the Euclidean distance between each synthetic record and its closest corresponding real neighbor. A high DCR implies a high level of privacy. The Nearest Neighbor Distance Ratio (NNDR), is the ratio between the distance of the closest and the distance of the second closest real neighbor for each synthetic record. It is bounded in [0, 1]. Higher values reveal better privacy. Low NNDR values indicate that each synthetic data can easily be associated with a unique original data, resulting in a low privacy level.
To provide a comparative framework for the DCR and NNDR metrics, the original data is divided in two sets. 70% of the dataset is used to create synthetic data. The holdout 30% is kept as original data to serve as a reference for the computation of DCR and NNDR.

Figure 6: Visualization of DCR and NNDR
Legend: Original (black dots) and synthetic (green dots) records projected in a mathematical space for distance to closest (left) and nearest neighbor distance ratio (right) metric computation.

SUPPLEMENTARY FIGURE 7
Supplementary Figure 7 presents two cases of local cloaking. On the left, the local cloaking differs from zero, thus making it difficult to trace back the link between an individual and their avatar. On the right, the local cloaking is zero, thus making it easy to link an individual to their avatar. The denser the local environment, the more protected the individuals. The median of all local cloaking is indicative of the overall protection level. A higher local cloaking median suggests that the individual is more protected. We consider that a local cloaking beyond 5 can be reasonably estimated as a sufficient level of privacy. Similarly, a higher hidden rate indicates that an individual is more protected. For example, a hidden rate of 90% means that if an attacker associates the most similar avatar to one individual, they have a 90% chance of being wrong. We estimate that a value higher than 90% ensures a sufficient level of privacy. This metric is the percentage of sensitive individuals without a local cloaking at 0 (right side of the figure), i.e. the percentage of individuals whose closest avatar is not the avatar derived from their data.